14 research outputs found
Track, then Decide: Category-Agnostic Vision-based Multi-Object Tracking
The most common paradigm for vision-based multi-object tracking is
tracking-by-detection, due to the availability of reliable detectors for
several important object categories such as cars and pedestrians. However,
future mobile systems will need a capability to cope with rich human-made
environments, in which obtaining detectors for every possible object category
would be infeasible. In this paper, we propose a model-free multi-object
tracking approach that uses a category-agnostic image segmentation method to
track objects. We present an efficient segmentation mask-based tracker which
associates pixel-precise masks reported by the segmentation. Our approach can
utilize semantic information whenever it is available for classifying objects
at the track level, while retaining the capability to track generic unknown
objects in the absence of such information. We demonstrate experimentally that
our approach achieves performance comparable to state-of-the-art
tracking-by-detection methods for popular object categories such as cars and
pedestrians. Additionally, we show that the proposed method can discover and
robustly track a large variety of other objects.Comment: ICRA'18 submissio
PolarMOT: How Far Can Geometric Relations Take Us in 3D Multi-Object Tracking?
Most (3D) multi-object tracking methods rely on appearance-based cues for
data association. By contrast, we investigate how far we can get by only
encoding geometric relationships between objects in 3D space as cues for
data-driven data association. We encode 3D detections as nodes in a graph,
where spatial and temporal pairwise relations among objects are encoded via
localized polar coordinates on graph edges. This representation makes our
geometric relations invariant to global transformations and smooth trajectory
changes, especially under non-holonomic motion. This allows our graph neural
network to learn to effectively encode temporal and spatial interactions and
fully leverage contextual and motion cues to obtain final scene interpretation
by posing data association as edge classification. We establish a new
state-of-the-art on nuScenes dataset and, more importantly, show that our
method, PolarMOT, generalizes remarkably well across different locations
(Boston, Singapore, Karlsruhe) and datasets (nuScenes and KITTI).Comment: ECCV 2022, 17 pages, 5 pages of supplementary, 3 figure
Walking Your LiDOG: A Journey Through Multiple Domains for LiDAR Semantic Segmentation
The ability to deploy robots that can operate safely in diverse environments
is crucial for developing embodied intelligent agents. As a community, we have
made tremendous progress in within-domain LiDAR semantic segmentation. However,
do these methods generalize across domains? To answer this question, we design
the first experimental setup for studying domain generalization (DG) for LiDAR
semantic segmentation (DG-LSS). Our results confirm a significant gap between
methods, evaluated in a cross-domain setting: for example, a model trained on
the source dataset (SemanticKITTI) obtains mIoU on the target data,
compared to mIoU obtained by the model trained on the target domain
(nuScenes). To tackle this gap, we propose the first method specifically
designed for DG-LSS, which obtains mIoU on the target domain,
outperforming all baselines. Our method augments a sparse-convolutional
encoder-decoder 3D segmentation network with an additional, dense 2D
convolutional decoder that learns to classify a birds-eye view of the point
cloud. This simple auxiliary task encourages the 3D network to learn features
that are robust to sensor placement shifts and resolution, and are transferable
across domains. With this work, we aim to inspire the community to develop and
evaluate future models in such cross-domain conditions.Comment: Accepted at ICCV 202
Pix2Map: Cross-modal Retrieval for Inferring Street Maps from Images
Self-driving vehicles rely on urban street maps for autonomous navigation. In
this paper, we introduce Pix2Map, a method for inferring urban street map
topology directly from ego-view images, as needed to continually update and
expand existing maps. This is a challenging task, as we need to infer a complex
urban road topology directly from raw image data. The main insight of this
paper is that this problem can be posed as cross-modal retrieval by learning a
joint, cross-modal embedding space for images and existing maps, represented as
discrete graphs that encode the topological layout of the visual surroundings.
We conduct our experimental evaluation using the Argoverse dataset and show
that it is indeed possible to accurately retrieve street maps corresponding to
both seen and unseen roads solely from image data. Moreover, we show that our
retrieved maps can be used to update or expand existing maps and even show
proof-of-concept results for visual localization and image retrieval from
spatial graphs.Comment: 12 pages, 8 figure
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Existing methods for instance segmentation in videos typi-cally involve
multi-stage pipelines that follow the tracking-by-detectionparadigm and model a
video clip as a sequence of images. Multiple net-works are used to detect
objects in individual frames, and then associatethese detections over time.
Hence, these methods are often non-end-to-end trainable and highly tailored to
specific tasks. In this paper, we pro-pose a different approach that is
well-suited to a variety of tasks involvinginstance segmentation in videos. In
particular, we model a video clip asa single 3D spatio-temporal volume, and
propose a novel approach thatsegments and tracks instances across space and
time in a single stage. Ourproblem formulation is centered around the idea of
spatio-temporal em-beddings which are trained to cluster pixels belonging to a
specific objectinstance over an entire video clip. To this end, we introduce
(i) novel mix-ing functions that enhance the feature representation of
spatio-temporalembeddings, and (ii) a single-stage, proposal-free network that
can rea-son about temporal context. Our network is trained end-to-end to
learnspatio-temporal embeddings as well as parameters required to clusterthese
embeddings, thus simplifying inference. Our method achieves state-of-the-art
results across multiple datasets and tasks. Code and modelsare available at
https://github.com/sabarim/STEm-Seg.Comment: 28 pages, 6 figure
Making a Case for 3D Convolutions for Object Segmentation in Videos
The task of object segmentation in videos is usually accomplished by
processing appearance and motion information separately using standard 2D
convolutional networks, followed by a learned fusion of the two sources of
information. On the other hand, 3D convolutional networks have been
successfully applied for video classification tasks, but have not been
leveraged as effectively to problems involving dense per-pixel interpretation
of videos compared to their 2D convolutional counterparts and lag behind the
aforementioned networks in terms of performance. In this work, we show that 3D
CNNs can be effectively applied to dense video prediction tasks such as salient
object segmentation. We propose a simple yet effective encoder-decoder network
architecture consisting entirely of 3D convolutions that can be trained
end-to-end using a standard cross-entropy loss. To this end, we leverage an
efficient 3D encoder, and propose a 3D decoder architecture, that comprises
novel 3D Global Convolution layers and 3D Refinement modules. Our approach
outperforms existing state-of-the-arts by a large margin on the DAVIS'16
Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster,
thus showing that our architecture can efficiently learn expressive
spatio-temporal features and produce high quality video segmentation masks. Our
code and models will be made publicly available.Comment: BMVC '2
Lidar Panoptic Segmentation and Tracking without Bells and Whistles
State-of-the-art lidar panoptic segmentation (LPS) methods follow bottom-up
segmentation-centric fashion wherein they build upon semantic segmentation
networks by utilizing clustering to obtain object instances. In this paper, we
re-think this approach and propose a surprisingly simple yet effective
detection-centric network for both LPS and tracking. Our network is modular by
design and optimized for all aspects of both the panoptic segmentation and
tracking task. One of the core components of our network is the object instance
detection branch, which we train using point-level (modal) annotations, as
available in segmentation-centric datasets. In the absence of amodal (cuboid)
annotations, we regress modal centroids and object extent using
trajectory-level supervision that provides information about object size, which
cannot be inferred from single scans due to occlusions and the sparse nature of
the lidar data. We obtain fine-grained instance segments by learning to
associate lidar points with detected centroids. We evaluate our method on
several 3D/4D LPS benchmarks and observe that our model establishes a new
state-of-the-art among open-sourced models, outperforming recent query-based
models.Comment: IROS 2023. Code at https://github.com/abhinavagarwalla/most-lp
MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking
Standardized benchmarks have been crucial in pushing the performance of
computer vision algorithms, especially since the advent of deep learning.
Although leaderboards should not be over-claimed, they often provide the most
objective measure of performance and are therefore important guides for
research. We present MOTChallenge, a benchmark for single-camera Multiple
Object Tracking (MOT) launched in late 2014, to collect existing and new data,
and create a framework for the standardized evaluation of multiple object
tracking methods. The benchmark is focused on multiple people tracking, since
pedestrians are by far the most studied object in the tracking community, with
applications ranging from robot navigation to self-driving cars. This paper
collects the first three releases of the benchmark: (i) MOT15, along with
numerous state-of-the-art results that were submitted in the last years, (ii)
MOT16, which contains new challenging videos, and (iii) MOT17, that extends
MOT16 sequences with more precise labels and evaluates tracking performance on
three different object detectors. The second and third release not only offers
a significant increase in the number of labeled boxes but also provide labels
for multiple object classes beside pedestrians, as well as the level of
visibility for every single object of interest. We finally provide a
categorization of state-of-the-art trackers and a broad error analysis. This
will help newcomers understand the related work and research trends in the MOT
community, and hopefully shed some light on potential future research
directions.Comment: Accepted at IJC